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O ■ Abstract 

o , 

CN| , A class of distortions termed functional Bregman divergences is defined, which 

• includes squared error and relative entropy. A functional Bregman divergence acts 

on functions or distributions, and generalizes the standard Bregman divergence for 
^ | ■ vectors and a previous pointwise Bregman divergence that was defined for functions. 

A recently published result showed that the mean minimizes the expected Bregman 
divergence. The new functional definition enables the extension of this result to the 
continuous case to show that the mean minimizes the expected functional Bregman 
divergence over a set of functions or distributions. It is shown how this theorem 
applies to the Bayesian estimation of distributions. Estimation of the uniform 
distribution from independent and identically drawn samples is used as a case study. 



1. Overview 



Bregman divergences are a useful set of distortion functions that include squared 
error, relative entropy, logistic loss, Mahalanobis distance, and the Itakura-Saito 
\ function. Bregman divergences are popular in statistical estimation and information 

theory. Analysis using the concept of Bregman divergences has played a key role in 
recent advances in statistical learning [1-9], clustering [10,11], inverse problems [12], 
maximum entropy estimation [13], and the applicability of the data processing 
theorem [14]. Recently, it was discovered that the mean is the minimizer of the 
expected Bregman divergence for a set of d-dimensional points [10,15]. 

In this paper we define a functional Bregman divergence that applies to functions 
and distributions, and we show that this new definition is equivalent to Bregman di- 
vergence applied to vectors. The functional definition generalizes a pointwise Breg- 
. man divergence that has been previously defined for measurable functions [7,16], 

and thus extends the class of distortion functions that are Bregman divergences; 
see Section 12.1.21 for an example. Most importantly, the functional definition en- 
ables one to solve functional minimization problems using standard methods from 
the calculus of variations; we extend the recent result on the expectation of vector 
Bregman divergence [10, 15] to show that the mean minimizes the expected Breg- 
man divergence for a set of functions or distributions. We show how this theorem 
links to Bayesian estimation of distributions. For distributions from the exponen- 
tial family distributions, many popular divergences, such as relative entropy, can 
be expressed as a (different) Bregman divergence on the exponential distribution 
parameters. The functional Bregman definition enables stronger results and a more 
general application. 



In Section 1 we state a functional definition of the Bregman divergence and 
give examples for total squared difference, relative entropy, and squared bias. The 
relationship between the functional definition and previous Bregman definitions 
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is established. In Section 2 we present the main theorem: that the expectation 
of a set of functions minimizes the expected Bregman divergence. In Section 3 we 
discuss the role of this theorem in Bayesian estimation, and as a case study compare 
different estimates for the uniform distribution given independent and identically 
drawn samples. For ease of reference, Appendix A contains relevant definitions and 
results from functional analysis and the calculus of variations. In Appendix B we 
show that the functional Bregman divergence has many of the same properties as 
the standard vector Bregman divergence. Proofs are in Appendix C. 

2. Functional Bregman Divergence 

Let (R d , fi, v} be a measure space, where v is a Borel measure, d is a positive 
integer, and define a set of functions A = {a G L p {v) subject to a : R d — > R, a > 0} 
where 1 < p < oo. 

Definition 2.1 (Functional Definition of Bregman Divergence). Let <\> : L p {v) — > 
R be a strictly convex, twice- continuously Frechet- differentiable functional. The 
Bregman divergence d^, : A X A — > [0, oo) is defined for all f,g<EAas 

(1) d*\f,g] = <j>\f]-4>\g]-6tl>\ g ;f-g], 

where 8<j)[g; •] is the Frechet derivative of <f> at g. 

Here, we have used the Frechet derivative, but the definition (and results in 
this paper) can be easily extended using more general definitions of derivatives; a 
sample extension is given in Section 12.1.31 

The functional Bregman divergence has many of the same properties as the 
standard vector Bregman divergence, including non-negativity, convexity, linearity, 
equivalence classes, linear separation, dual divergences, and a generalized Pythagorean 
inequality. These properties are established in Appendix B. 

2.1. Examples. Different choices of the functional <fi lead to different Bregman 
divergences. Illustrative examples are given for squared error, squared bias, and 
relative entropy. Functionals for other Bregman divergences can be derived based 
on these examples, from the example functions for the discrete case given in Table 
1 of [15], and from the fact that </> is a strictly convex functional if it has the 
form (f)(g) — J <p(g(t))dt where 4> '■ K — > R, (f> is strictly convex and g is in some 
well-defined vector space of functions [17]. 

2.1.1. Total Squared Difference. Let <p[g] = J g 2 dv, where <fi : L 2 (v) — > R, and let 
gJ,a^L 2 {v). Then 

(j)[g + a] - 4>[g] = J (g + afdv - J g 2 dv 
= 2 gadv + / a 2 dv. 



Because 



as a — > in L 2 {v), 



IMU 2 m IMUv) 

j; a] = 2 I gadv, 
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which is a continuous linear functional in a. Then, by definition of the second 
Frechet derivative, 



,,b, a] = S(/)[g + b; a] - 5(j)[g; a] 

— 2 J (g + b)adv — 2 J gadv 

= 2 badv. 



Thus 5 2 (j)[g; 6, a] is a quadratic form, where 5 2 <j) is actually independent of g and 
strongly positive since 

5 2 <j>[g;a,a}=2 J a 2 dv = 2\\a\^ [v) 
for all a £ L 2 (v), which implies that <fi is strictly convex and 

d*[/, g] = J fdv - J g 2 dv - 2 J g(f - g)dv 

(/ - gfdv 



= IIZ-sllz,^)- 

2.1.2. Squared Bias. Under definition ([T]), squared bias is a Bregman divergence, 
this we have not previously seen noted in the literature despite the importance of 
minimizing bias in estimation [18]. 

Let <j>[g] = (J gdv) , where cf> : L x (y) — > K. In this case 

cj)\g + a] - (j>\g\ = (^J gdv + J adv\ - (^J gdv 

(2) = 2 J gdv Jadv+(Jadv\ . 

Note that 2 J gdv J adv is a continuous linear functional on L x {y) and (J adv) 2 < 



||a||| 1(l/) , so that 



(fadvf < N^) 



Thus from @ and the definition of the Frechet derivative, 

6<fr[g; a] = 2 J gdv J adv. 

By the definition of the second Frechet derivative, 
5 2 <p[g;b,a] = 6<p[g + b; a] - Scj)[g; a] 

= 2 J (g + b)dv J adv — 2 J gdv J adv 



= 2 J bdv J adv 
is another quadratic form, and 5 2 <fi is independent of g. 
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Because the functions in A are positive, S 2 <fi is strongly positive on A (which 
again implies that <f> is strictly convex): 

2 



(/ adA = 2||o||li W > 



for a iE A. The Bregman divergence is thus 

dtif, g) 



2 , . ^2 

fdv ) - / gdv 



ijgdv J(f- g)dv 



J,U> ) +( I gdv) -2 I gdv I \\h. 



(f-g)du 



2 



< \\f-g\\h { ,y 

2.1.3. Relative Entropy of Simple Functions. Let (X, S, v) be a measure space. We 
denote by S the collection of all measurable simple functions on (X, E, v), that is, 
the set of functions which can be written as a finite linear combination of indicator 
functions. If g € S then it can be expressed as 

/ 

i=0 

where It ( is the indicator function of the set Tj and -pi}*_ is a collection of 
mutually disjoint measurable sets with the property that X = U* =0 Ti. We adopt 
the convention, that To is the set on which g is zero and therefore on ^ if i ^ 0. 
The set (S, \\ ■ \\ L is a normed vector space. In this case 

(3) / qlnqdv ~2_, / otilnoLidv, 

since OlnO = 0. 

Note that the integral in Q exists and is finite for g £ <S if g G L 1 ^) and g > 0. 
This implies that v{Ti) < oo for all 1 < i < t, while the measure of To could be 
infinity. For this reason, consider the normed vector space {L l (v) C\S,\\ ■ 
where (L^iy) n5)c Sc L°°(v). Let W be the set (not necessarily a vector space) 
of functions satisfying the conditions mentioned above - that is, let 

W = {g £ LV) n 5 subject to g > 0}. 

Define the functional 6 on W, 



(4) 4>[g] = / glngdv, geW. 

Jx 

The functional 4> is not Frechet-differentiable at g because in general it cannot be 
guaranteed that g + h is non-negative for all functions h in the underlying normed 
vector space (L 1 (i/) PiS, \\ ■ with norm smaller than any prescribed e > 0. 

However, a generalized Gateaux derivative can be defined if we limit the perturbing 
function ft, to a vector subspace. 
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Let Q be the subspace of fl S, || • ||l°°(„)) defined by 

Q = {/ e L l (v) n 5 subject to / dv < gc^}. 

It is straightforward to show that 5 is vector space. We define the generalized 
Gateaux derivative of (j> at g £ W to be the linear operator <5<30[g; '] if 

(5) Um m +h] ~f 9] ~ Sg ^-' h]l = o. 

||h||i,»(„)-0 \\Hl^{u) 

heQ 

Note, that 8Q<f>[g\ •] is not linear in general, but it is on the vector space Q. In 
general, if Q is the entire underlying vector space then ([5|) is the Frechet derivative, 
and if Q is the span of only one element from the underlying vector space then ([5]) 
is the Gateaux derivative. Here, we have generalized the Gateaux derivative for the 
present case that Q is a subspace of the underlying vector space. 

It remains to be shown that given the functional (HJ, the derivative ([5]) exists 
and yields relative entropy. Consider the solution 

(6) 5 G cj>{g;h)= f (l+lng)hdv, 

which coupled with ((4| does yield relative entropy. We complete the proof by 
showing that ([6|) satisfies ((5]). Note that 

<f>[g + h]- cf>[g] - <5 G <?%; h] = [ (h + g) In - hdv 



x 9 



(7) 



= / (h + g) In '■ hdv, 

Je 9 



where E is the set on which g is not zero. 

Because g 6 W, there are m,M > such that m < g < M on E. Let h E Q be 
such that ||/i||ioo( y ) < m, then g + h > 0. Our goal is to find a lower and an upper 
bound for the expression 

4>[g + h\- 4>\g] - S G cj)[g; h] 



such that both bounds go to as ||/i|| £■»(!,) —* 0. We start with bounding the 
integrand from above: 

<h + g)ln—^-h< (h + g)--h = —. 

9 9 9 



and therefore 



ttg + ^-^-Sa^h] < l f tf du 
II^IU-H ~ II^IU-M Je 9 

< — I \h\dv 



< 



rn 
1 

m 



E 



We can use Jensen's inequality to find a lower bound for the integral (|7|). In 
order to use the inequality we have to rewrite the equation. We begin with the first 
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term of the integrand, 



(h + g) In — ^—^-dv 

e 9 

h + 9 (, h + g\ 

m gdv, 

e 9 V 9 



I, H f h + g h + g g 

\\9 \W{u) / In 7] n dv 

Je 9 9 \\9\\lHu) 

\\9\\v-{u) J ^ (~~) d£> ' 



where the measure dv = j^rp^ — dv is a probability measure and X(x) = x In x is a 
convex function on (0, oo). Let Mq — ||<?||z,i(„). By Jensen's inequality 



U„ \ ( / -^-dv + [ dv 

Ie M o Je 

= M X { ±-J E hdv + l 

h dv + M ) In ( — / hdv + 1 

M J e 



Thus we can bound the integral in J7|) from below: 



fi \ i h + 9 ii 
(h + g ) In 1 /ww 

I: 5 

( J hdv + Mojlnf-^- J hdv + l\- J hdv 

hdvln ( / hdv + 1 

; \MoJe 

-M ln(— / hdv+1] - / hdv. 
\Mq J e J J E 
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If J E hdv — 0, then the integral in ([7]) is non-negative. The more interesting case 
is when f E h dv ^ 0. Then, 

4\g + h] - <f>\g] - 5 a 4>\g; h] 



> 



II^IU°°(i/) 

j E hdv 



In 



> 



\\n\\L°°(v) 

IWU°°(i/) 
S E hdv 

\\h\\L°°{v) 



1 

Mo 



hdv + 1 



1 



Mo 



In — / fcdi/ + 1 - 



In 



M, 



E 

hdv + 1 



J E h dv 



JE 



M o^(^-J E hdv + 1 



J E hdv 



l E hdv 

II^IU~(i/) 



As J E hdv — > 0, 



In 



M 



hdv + 1 



o J_E 



and 



M, 



0. 



We finish the proof by showing that there is a constant X which is independent of 
h such that 



(8) 



/i dv 



< \\h\\ L l {u) <K\\h\\ L oo {v) . 



If ([5]) is shown, then J £ hdv — > and — > as || ^|| i=° (^) — > 0, and coupling 
those relationships with the fact that 



\J E hdv\ 



< K 



\\h\\ L ~ {v ) 

establishes (|5j). Because h G G, h can be expressed as 

V 

h = ^2^l Vi \ A) = o, 

where {Vi}|_ is a collection of mutually disjoint measurable sets with the property 
that X = [J" =0 Vi. Also, because hdv <C gc?^, there is a set N(h) such that 
z/(iV(fe)) = and 

(jViC (\jT t UN(h)\ . 

i=l \i=l / 

This implies that there is a _KT independent of h such that 

V t 

i=l i=l 
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Finally, 



h\dv = £|AK^) 



V 



< 



iifciu-(„)5>w) 



i=l 



< 



H/tllioo^if. 



2.2. Relationship to Other Bregman Divergence Definitions. Two propo- 
sitions establish the relationship of the functional Bregman divergence to other 
Bregman divergence definitions. 

Proposition 2.2 (Functional Bregman Divergence Generalizes Vector Bregman 
Divergence) . The functional definition (QJ) is a generalization of the standard vector 
Bregman divergence 



where x, y 6 K. n , and <f> : K™ — > M zs strictly convex and twice differentiable. 

Jones and Byrne describe a general class of divergences between functions using a 
pointwise formulation [7] . Csiszar specialized the pointwise formulation to a class of 
divergences he termed Bregman distances B sv [16], where given a cr-finite measure 
space (X, and non-negative measurable functions f(x) and g(x), B a v (f,g) 

equals 



The function s : (0, oo) — > R is constrained to be differentiable and strictly convex, 
and the limit lim^^o s(x) and lima;_>o s'(x) must exist, but not necessarily finite. 
The function s plays a role similar to the function <fi in the functional Bregman 
divergence; however, s acts on the range of the functions /, g, whereas <f> acts on 
the pair of functions /, g. 

Proposition 2.3 (Functional Definition Generalizes Pointwise Definition). Given 
a pointwise Bregman divergence as per ilO\) , an equivalent functional Bregman di- 
vergence can be defined as per [Ity if the measure v is finite. However, given a 
functional Bregman divergence d<f, [/, g] , there is not necessarily an equivalent point- 
wise Bregman divergence. 



Consider two sets of functions (or distributions), A4 and A. Let F 6 A4 be a 
random function with realization /. Suppose there exists a probability distribution 
Pf over the set M, such that Pf(/) is the probability of / £ M. For example, 
consider the set of Gaussian distributions, and given samples drawn independently 
and identically from a randomly selected Gaussian distribution N, the data imply a 
posterior probability P^(Af) for each possible generating realization of a Gaussian 
distribution Af. The goal is to find the function g* £ A that minimizes the expected 
Bregman divergence between the random function F and any function g £ A. The 
following theorem shows that if the set of possible minimizers A includes Ep F [F], 
then g* = Ep F [F] minimizes the expectation of any Bregman divergence. 



(9) 



d$(x, y) = 4>(x) - 4>{y) - V4>{y) T (x - y), 



(10) 




3. Minimum Expected Bregman Divergence 
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The theorem applies only to a set of functions M. that lie on a finite-dimensional 
manifold M for which a differential element dM can be defined. For example, 
the set M. could be parameterized by a finite number of parameters, or could be 
a set of functions that can be decomposed into a finite set of d basis functions 
{■01, ij>2, . . . , ipd} such that each / can be expressed as 

d 

f = ^Cjlpj, 

i=i 

where Cj £ K for all j. The theorem requires slightly stronger conditions on <f> than 
the definition of the Bregman divergence |T]) requires. 

Theorem 3.1 (Minimizer of the Expected Bregman Divergence). Let 5 2 (f>[f;a,a\ 
be a strongly positive quadratic form, and let <fi £ C 3 (L 1 (z/);]R) be a three-times 
continuously Frechet- differentiate functional on L x {y). Let M. be a set of functions 
that lie on a finite- dimensional manifold M , and have associated differential element 
dM . Suppose there is a probability distribution Pp defined over the set M.. Suppose 
the function g* minimizes the expected Bregman divergence between the random 
function F and any function g £ A such that 

g* = arg inf E Pf [d (F, g)}. 
geA 

Then, if g* exists, it is given by 

(11) <?* = [ fP(f)dM = E Pp [F]. 

J M 

4. Bayesian Estimation 

Theorem II. 1 can be applied to a set of distributions to find the Bayesian estimate 
of a distribution given a posterior or likelihood. For parametric distributions pa- 
rameterized by 9 G R™, a probability measure A(8), and some risk function R(6, t/j), 
if) £ M™, the Bayes estimator is defined [19] as 

(12) 9 = arg inf [ R(6,ip)dA(0). 

That is, the Bayes estimator minimizes some expected risk in terms of the param- 
eters. It follows from recent results [15] that 6 — E[Q] if the risk R is a Bregman 
divergence, where O is the random variable whose realization is 9. 

The principle of Bayesian estimation can be applied to the distributions them- 
selves rather than to the parameters: 

(13) $ = arg inf :/ R{f,g)P F (f)dM, 

where Pf(/) is a probability measure on the distributions / £ M, dM is a differ- 
ential element for the finite-dimensional manifold M, and A is either the space of 
all distributions or a subset of the space of all distributions, such as the set M.. 
When the set A includes the distribution Ep F [F] and the risk function R in (fT51) 
is a Bregman divergence, then Theorem II. 1 establishes that g = Ep F [F]. 

For example, in recent work, two of the authors derived the mean class posterior 
distribution for each class for a Bayesian quadratic discriminant analysis classifier 
[6], and showed that the classification results were superior to parameter-based 
Bayesian quadratic discriminant analysis. 
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Of particular interest for estimation problems are the Bregman divergence ex- 
amples given in Section 12.11 total squared difference (mean squared error) is a 
popular risk function in regression [18]; minimizing relative entropy leads to use- 
ful theorems for large deviations and other statistical subfields [20] ; and analyzing 
bias is a common approach to characterizing and understanding statistical learning 
algorithms [18]. 

4.1. Case Study: Estimating a Scaled Uniform Distribution. As an illus- 
tration, we present and compare different estimates of a scaled uniform distribution 
given independent and identically drawn samples. Let the set of uniform distribu- 
tions over [0, 9] for 9 G R + be denoted by U. Given independent and identically 
distributed samples Xi,X 2 , ■ ■ ■ ,X n drawn from an unknown uniform distribution 
/ GU, the generating distribution is to be estimated. The risk function R is taken 
to be squared error or total squared error depending on context. 

4.1.1. Bayesian Parameter Estimate. Depending on the choice of the probability 
measure A(9), the integral (TT2"|) may not be finite; for example, using the likelihood 
of 9 with Lebesgue measure the integral is not finite. A standard solution is to use 
a gamma prior on 9 and Lebesgue measure. Let be a random parameter with 
realization 9, let the gamma distribution have parameters t\ and t 2 , and denote the 
maximum of the data as A max = max{Ai, X 2 , ■ ■ . , X n }. Then a Bayesian estimate 
is formulated [19, p. 240, 285]: 

E[Q\{X 1 ,X 2 ,...,X n },t 1 ,t 2 ] 

(14) = IZ,Je^TTr^ d9 

XL. e*Arre^d6 ' 

The integrals can be expressed in terms of the chi-squared random variable 1% with 
v degrees of freedom: 

E[Q\{X 1 ,X 2 ,...,X n },t 1 ,t 2 ] = 



(15) 



1 P (xl( n+tl -l) < * 2 X m ax) 



Hn + h-a) P( x 2 (n+ti) < _2_) ' 
Note that (fT2| presupposes that the best solution is also a uniform distribution. 



4.1.2. Bayesian Uniform Distribution Estimate. If one restricts the minimizer of 
([13]) to be a uniform distribution, then (fl3)) is solved with A = U. Because the 
set of uniform distributions does not generally include its mean, Theorem II. 1 does 
not apply, and thus different Bregman divergences may give different minimizers 
for (fl3|) . Let Pp be the likelihood of the data (no prior is assumed over the set U) , 
and use the Fisher information metric ( [21-23]) for dAI. Then the solution to (fT3"|) 
is the uniform distribution on [0, 2 1 /™A max ]. Using Lebesgue measure instead gives 
a similar result: [0, 2 1 /(" +1 / 2 )A max ]. We were unable to find these estimates in the 
literature, and so their derivations are presented in Appendix C. 

4.1.3. Unrestricted Bayesian Distribution Estimate. When the only restriction placed 
on the minimizer g in (fT"3f is that g be a distribution, then one can apply Theo- 
rem II. 1 and solve directly for the expected distribution Ep F [F]. Let Pp be the 
likelihood of the data (no prior is assumed over the set U), and use the Fisher 
information metric for dM . Solving (fTTj) . noting that the uniform probability of x 
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is f(x) = 1/a if x < a and zero otherwise, and the likelihood of the n drawn points 
is (l/Xmax)" if a > X max and zero otherwise, 

f°° (1) (J-) (da) 

*/ \ Jmax(a:,X max ) \ a) \ a n J V a/ 

9 ( x ) = Too — Jfo 

J; 

(16) 



(n+l)[max(a:,Jf niax )]»»+ 1 * 



4.1.4. Projecting the Unrestricted Estimate onto the Set of Uniform Distributions. 
Consider what happens when the unrestricted solution g* (x) given in (|16|) is pro- 
jected onto the set of uniform distributions with respect to squared error. That is, 
we solve for the uniform distribution h{x) over [0, a] such that: 



(17) a — arg min / (h(x) — g* (x)) dx. 

ae[0,oo)J 

The problem is straightforward to solve using standard calculus and yields the 
solution a = 2 1 /™X max . This is also the solution to the problem (Tl3|) when the min- 
imizer is restricted to be a uniform distribution and the Fisher information metric 
over the uniform distributions is used (as discussed in Section 14. 1 . 3[) . Thus, the 
projection of the unrestricted solution to (fl~3)) onto the set of uniform distributions 
is the same as the solution to (|13j) when the minimizer is restricted to be uniform. 
We conjecture that under some conditions this property will hold more generally: 
that the projection of the unrestricted minimizer of (flU)) onto the set M will be 
equivalent to solving (fT5|) where the solution is restricted to the set M. 

4.2. Simulation. A simulation was done to compare the different Bayesian estima- 
tors and the maximum likelihood estimator. The simulation was run 1, 000 times; 
each time n data points were drawn independently and identically from the uni- 
form over [0, 1], and estimates were formed. Figure 1 is a log-log plot of the average 
squared errors between the estimated distribution and the true distribution. 

For the Bayesian parameter estimator given in (|15p . estimates were calculated 
for three different sets of Gamma parameters, (t\ — l,t2 — 1), (ti — = 3), 
and (ti — l,t2 — 100). The plotted error is the minimum of the three averaged 
errors for the different Gamma priors for each n. The plotted Bayesian distribution 
estimates used the Fisher information metric (very similar simulation results were 
obtained with the Lebesgue measure). 

Given more than one random sample from the uniform, the unrestricted Bayesian 
distribution estimator (thick line) always performed better than the other estima- 
tors (as it should by design). Of course, asymptotically as n — > oo, all of the 
estimates will converge to the true value. For n = 1, the Bayesian parameter esti- 
mate performs better; we believe this is due to the (in this case correct) bias of the 
prior used for the Bayesian parameter estimate. The dotted line rises at n = 155 
because the Bayesian parameter estimate was uncomputable for more than 155 
data samples (we used Matlab v. 14 to evaluate (fT5|) . and for 155 data samples or 
more the numerator and denominator of (|15p were determined to be 0, leading to 
an indeterminate estimate) . 

Three interesting conclusions are supported by the simulation results. First, the 
Bayesian estimates do improve significantly over the maximum likelihood estimate 
(dashed line). Second, although the truth is uniform, the unrestricted Bayesian 
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NUMBER OF DATA SAMPLES 



Figure 1 . The plot shows the log of the squared error between an 
estimated distribution and a uniform [0, 1] distribution, averaged 
over one thousand runs of the estimation simulation. The dashed 
line is the maximum likelihood estimate, the dotted line is the 
Bayesian parameter estimate, the thick solid line is the Bayesian 
distribution estimate that solves (fT3|) . and the thin solid line is the 
Bayesian distribution estimate that solves (|13p when the minimizer 
is restricted to be uniform. 

distribution estimate chooses a non-uniform solution (thick line) , which does signif- 
icantly better than either of the Bayesian uniform estimates (thin line and dotted 
line). Third, the Bayesian parameter estimate (dotted line) and the Bayesian uni- 
form distribution estimate (thin line) perform quite similarly. For n < 10, the 
Bayesian parameter estimate works better, but for n > 10, the Bayesian uniform 
distribution estimate is slightly better. Although these two estimates perform sim- 
ilarly, the Bayesian uniform distribution estimate [0, 2 1 /™X max ] is a more elegant 
solution than the parameter estimate (fT5|) . and is easier to compute and to work 
with analytically. 

5. Further Discussion and Open Questions 

We have defined a general Bregman divergence for functions and distributions 
that can provide a foundation for results in statistics, information theory and signal 
processing. Theorem II. 1 is important for these fields because it ties Bregman 
divergences to expectation. As shown in Section [U Theorem II. 1 can be directly 
applied to distributions to show that Bayesian distribution estimation simplifies to 
expectation when the risk function is a Bregman divergence and the minimizing 
distribution is unrestricted. 

It is common in Bayesian estimation to interpret the prior as representing some 
actual prior knowledge, but in fact prior knowledge often is not available or is diffi- 
cult to quantify. Another approach is to use a prior to capture coarse information 
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from the data that may be used to stabilize the estimation [6,9]. In practice, priors 
are sometimes chosen in Bayesian estimation to tame the tail of likelihood distri- 
butions so that expectations will exist when they might otherwise be infinite [19]. 
This mathematically convenient use of priors adds estimation bias that may be un- 
warranted by prior knowledge. An alternative to mathematically convenient priors 
is to formulate the estimation problem as a minimization of an expected Bregman 
divergence between the unknown distribution and the estimated distribution, and 
restrict the set of distributions that can be the minimizcr to be a set for which 
the Bayesian integral exist. Open questions are how such restrictions affect the 
estimation bias and variance, and how to find or define a "best" restricted set of 
distributions for this estimation approach. 

Finally, there are some results for the standard vector Bregman divergence that 
have not been extended here. It has been shown that a standard vector Bregman 
divergence must be the risk function in order for the mean to be the minimizer of 
an expected risk [15, Theorems 3 and 4]. The proof of that result relies heavily on 
the discrete nature of the underlying vectors, and it remains an open question as 
to whether a similar result holds for the functional Bregman divergence. Another 
result that has been shown for the vector case but remains an open question in the 
functional case is convergence in probability [15, Theorem 2]. 
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Appendix A: Relevant Definitions and Results from Functional 

Analysis 

This appendix explains the basic definitions and results from functional analysis 
used in this paper. This material can also be found in standard books on the 
calculus of variations, such as the text by Gelfand and Fomin [24]. 

Let (R d ,n,u) be a measure space, where v is a Borel measure d is a positive 
integer, and define a set of functions A — {a £ L p (v) subject to a : R d — > R, a > 0} 
where 1 < p < oo. The subset A is a convex subset of L p (u) because for ai, a 2 E A 
and < u) < 1, u)a\ + (1 — u)a 2 E A. 
Definition of continuous linear functionals 
The functional ip : L p (v) — > M is linear and continuous if 

(1) ip[u>ai + a 2 ] = u)ip[ai] + ip[a 2 ] for any a\,a 2 E L p {v) and any real number 
w; and 

(2) there is a constant C such that \tp[a]\ < C\\a\\ for all a E L p (v). 

Functional Derivatives 

(1) Let 4> be a real functional over the normed space L v (y). The bounded linear 
functional 8<j)[f; ■} is the Frechet derivative of at / E L p (v) if 

<t>[f + a]-<j>[f}=A4>[f;a] 
(!8) =H[.f-a]+e[f,a]\\a\\ LP{v) 

for all a e L p {v), with e[f, a] — > as ||a|| LP , , — > 0. 
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(2) When the second variation S 2 (j) and the third variation 8 3 (f> exist, they are 
described by 

A0[/;a] = ^[/;a] + i<5 2 0[/;a,a] 

(19) +4f,o]\\a\\ 2 LP{v) 

= 5^[f-a]+ l -8 2 ^[j-a,a] 

+ «] ll a ll|i>(„) - 

where e[f, a] — > as HfflH^p/^ — > 0. The term <5 2 ^!>[/;a, 6] is bilinear with 
respect to arguments a and b, and 5 3 0[/; a, b, c] is trilinear with respect to 
a, b, and c. 

(3) Suppose {a„},{/„} c ^(v), moreover a n — ► a, / n — > /, where a, / € 
£ p (z/). If 6 C 3 (i p (i/);M) and <ty[/;a], 6 2 (/>[/; a, a], and 5 3 [/;a,a,a] are 
defined as above, then a„] — > 5<f>[f;a], S 2 4>[f n ; a„, a„] — > <5 2 ^[/;a, a], 
and <5 3 <^[/„; a n , a„, a„] -> <5 3 0[/; a, a, a], respectively. 

(4) The quadratic functional 8 2 <f>[f; a, a] defined on normed linear space L v (y) 
is strongly positive if there exists a constant k > such that 5 2 </>[/; a, a] > 
^ II a II lp (y) f° r a ^ ft £ A In a finite-dimensional space, strong positivity of 
a quadratic form is equivalent to the quadratic form being positive definite. 

(5) From (H]), 

<t>[f + a] = rtf}+80[f;a} + ^5 2 ( f>[f;a,a} 

4>[f] = </>\f + a]-64,[f + a;a] + 

±5 2 cf>[f + a;a,a}+o(\\a\\ 2 ), 

where o(||a|| 2 ) stands for a function that goes to zero as ||a|| goes to zero, 
even if it is divided by ||a|| 2 . Adding the above two equations yields 

= ty[/;a]-ty[/ + a;o] + ^V[/;a,a] 

+ ^S 2 4>lf + a;a,a} + o(\\a\\ 2 ), 

which is equivalent to 

(20) 5<j>[f + a;a}- a] = 5 2 <f>[f; a, a] + (||a|| 2 ), 
because 

|<^ 2 0[/ + a 'l a : a ] ~ ^ 2 0[/; °J a ] I 

< ||^[/ + a ;-,-]-<5 2 0[/;.,.]||||a|| 2 , 

and we assumed <f) e C 2 , so 8 2 <f)[f + a; a, a] — <5 2 0[/; a, a] is of order o(||a|| 2 ). 
This shows that the variation of the first variation of <j> is the second varia- 
tion of 4>. A procedure like the above can be used to prove that analogous 
statements hold for higher variations if they exist. 
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Functional Optimality Conditions For a functional J to have an cxtrcmum 
(minimum) at / = /, it is necessary that 

5J[f; a] = and 5 2 J[f; a, a] > 0, 

for / = / and for all admissible functions a G A. A sufficient condition for a 
functional J[f] to have a minimum for / = / is that the first variation SJ[f; a] must 
vanish for f — f , and its second variation S 2 J[f; a, a] must be strongly positive for 
/ = /■ 

Appendix B: Properties of the Functional Bregman Divergence 

The Bregman divergence for random variables has some well-known properties, as 
reviewed in [10, Appendix A]. Here, we establish that the same properties hold for 
the functional Bregman divergence {!]). 

1. Non- negativity 

The functional Bregman divergence is non-negative. To show this, define : R — > K 
by <f)(t) = [tf + (1 — t)g], f,g € A. From the definition of the Frechet derivative, 

(21) j t 4> = 5<t>[tf + (l-t)g;f-g]. 

The function (j) is convex because <f> is convex by definition. Then from the mean 
value theorem there is some < to < 1 such that 

(22) 0(1) - m = |# ) > j t m- 

Because 0(1) = </>[/], 0(0) = 4>[g], and (|2T|) . subtracting the right-hand side of ([^]) 
implies that 

(23) 0[/] - <t>[g] - <50[ 5 , /-<?]> 0. 

If / = g, then (|23|) holds in equality. To finish, we prove the converse. Suppose 
(|2U)) holds in equality; then 

(24) 0(1)-&O) = |&O). 

The equation of the straight line connecting 0(0) to 0(1) is £(t) — 0(0) + (0(1) — 
0(O))i, and the tangent line to the curve at 0(0) is y(t) = 0(0) -H^0(O). Because 
0(t) = 0(0) + J Q T ^<j){t)dt and ^0(i) > ^0(0) as a direct consequence of convexity, 
it must be that 0(i) > y(t). Convexity also implies that £(t) > 0(t). However, the 
assumption that (f2"5)) holds in equality implies (|24p. which means that y(i) = ^(t), 
and thus 0(i) = i"(t), which is not strictly convex. Because is by definition strictly 
convex, it must be true that 0[</ + (1 — t)g] < tcf)[f] + (1 — t)(f>[g] unless / = g. 
Thus, under the assumption of equality of (f2"5| . it must be true that / = g. 

2. Convexity 

The Bregman divergence d</,[/, g] is always convex with respect to /. Consider 

Ad [/,3;a] = d^f + a,g] - d^,[f,g] 

= 0[/ + a] - 0[J] - <50[ 5 ; f- g + a ] + 

HW,f - 5]- 
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Using linearity in the third term, 

Ad [/,3;a] 

= 4\J + a] - - 64>[g; f - g] - S0[g; a] 

+H[g;f - 9}, 

= 4\J + a] -</>[/] -64>[g; a], 

- 5<j)[f\ «] + j^M/; a ' a l + e t/' a l ll a llz,(r/) ~ ^[5; a] 
<S 2 <M/,s;a,a] = ^S 2 (j)[f-a,a} > 0, 

where (a) and the conclusion follows from (|19p. 

3. Linearity 

The functional Bregman divergence is linear in the sense that 

^(ci01+C 2 02)[/>5] 

= (ci0i + C2I^2) [/] - (Cl^l + C 2 02)[g] - 

(5(ci0i + c 2 ^2)[5;/-.9], 
= c 1 d c f >1 [f,g}+c 2 d 4>2 [f,g}. 

4. Equivalence Classes 

Partition the set of strictly convex, differentiable functions {(/)} on A into classes 
with respect to functional Bregman divergence, so that <f>\ and <p2 belong to the 
same class if dfa [/, g] = [/, g] for all f,g G A. For brevity we will denote [/, 5] 
simply by d^ . Let 0i ~ 4>2 denote that 4>i and <j>2 belong to the same class, then ~ 
is an equivalence relation because it satisfies the properties of reflexivity (because 
d^ >1 = dfa), symmetry (because if d ( p 1 = dfa, then d^ = dfa), and transitivity 
(because if dfa = d<p 2 and = d$ 3 , then d$ 1 = d$ 3 ). 

Further, if 4>x ~ </>2, then they differ only by an affine transformation. To sec this, 
note that, by assumption, <j>i[f] -(j>i[g\ S(f>i[g; f - g] = (h[f]-<h\9\ ~ s h[9\ f-g], 
and fix g so <f>i\g] and cf>2 [g] are constants. By the linearity property, 5<fi[g; / — g] = 
<50[(?; /] — <5<?%; 5], and because g is fixed, this equals 8(f>[g; f]+Co where Co is a scalar 
constant. Then faif] = 4>i[f] + (&/>2[3; /] — S4>i[g\ /]) + Ci, where Ci is a constant. 
Thus, 

where A = <50 2 [3; •] — 8<fri[g; •], and thus ^4 : A — * K is a linear operator that does 
not depend on /. 

5. Linear Separation 

Fix two non-equal functions 31 , 32 £ and consider the set of all functions in A 
that arc equidistant in terms of functional Bregman divergence from 31 and 32: 

d<f,[f,gi] = eW/,32] 

=> -<t>[9i] - H[gi-.f - 3i] = -$[92] - S(j>[g 2 ; f - 92} 

=> -5<t>[9i;f - 51] = <?%i] - ^[.92] - <fy%2; / - 32]- 
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Using linearity the above relationship can be cquivalently expressed as 



-8<t>\9V,f\+8<t>\9i;9i] = <?%i] - <?fc] - 8<p[g 2 ; f] + 

8<P\g2\ 92], 

8<t>[g2;f] -8(j)[gv,f] = cj>[gi] ~ <j)[g 2 ] - 5(t>[gi\ gi] + 

8(j>[g2\ 92] • 
Lf = c, 



where L is the bounded linear functional defined by Lf — S(f>[g2', /] — 8<p[g\\ /], and 
c is the constant corresponding to the right-hand side. In other words, / has to be 
in the set {a <E A : La — c}, where c is a constant. This set is a hyperplane. 
6. Dual Divergence 

Given a pair (g,4>) where g E L p (v) and <fi is a strictly convex twice-continuously 
Frechet-differentiablc functional, then the function-functional pair (G, ip) is the Le- 
gendre transform of (g, <f)) [24], if 



where if) is a strictly convex twice-continuously Frechet-differentiable functional, 
and G E L q (u), where ± + ± = 1. 

Given Legendre transformation pairs /, g E L p {v) and F,G E L q {v), 



Applying the Legendre transformation to (G, -0) implies that 



(26) 



(25) 




d <p [f,g]=d^[G,F}. 



The proof begins by substituting P5|) and (|2T))) into (P}: 



(27) 




(29) 



(28) 




Using (QHD and (JUJ), d^[G,F] can be reduced to (J2ZJ). 
7. Generalized Pythagorean Inequality 

For any f,g,he A, 



d^lf, h] = d^[f,g] + d<f,[g,h] + 5<f>{g: f - g] - S(f>[h; f - g]. 
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This can be derived as follows: 

d^[f,g] + d<j,[g,h] 

= m - <j>[h] - 5<p[g- f-g}- 6<t>[h; g - h] 
= <f>[f] - <j>[h} - 6^[h; f - h] + 54>[h; f - h] 

-8<j>\g; f - g] -H[h-,g-h] 

= d4f,h}+5<t>[h;f-g]-5<t>[g;f-g}, 

where the last line follows from the definition of the functional Bregman divergence 
and the linearity of the fourth and last terms. 

Appendix C: Proofs 

5.1. Proof of Proposition 1.2. We give a constructive proof that there is a corre- 
sponding functional Bregman divergence d^lf, g] for a specific choice of (f> : A 1 — » K, 
where A 1 is the set of functions A with p = 1, and where v = J27=i $<h an d 
/, g 6 A 1 . Here, 5 X is the Dirac measure (such that all mass is concentrated at x) 
and {ci, C2, . . . , c n } is a collection of n distinct points in M. d . 

For any x E E™, define <p[f\ = (j>(xx, x%, . . . , x n ), where /(ci) = Xi,f(c2) = 
X2, ■ ■ ■ , f(cn) = x n- Then the difference is 

A^[/;a]=0[/ + a]-4f] 

= <f>((f + a){ci),...,(f + a)(c n )) - 4>(x!, . . . ,x n ) 

= 4>(xi + a(ci), ...,x„ + o(cn)) — <f>(xi,...,x n ) ■ 

Let cii be short hand for a(cj), and use the Taylor expansion for functions of several 
variables to yield 

A</>[/;a] = V0(xi, . . . ,x„) T (ai, . . . ,a„) + e[/, a] \\a\\ L i. 

Therefore, 

<50[/;a] = V<t>(xi, . . . ,x„) T (ai, . . . ,a n ) = V(j>{x) T a, 

where x = (x%, xi, . . . , x n ) and a = (oi, . . . , a„). Thus, from (3), the functional 
Bregman divergence definition (|TJ) for <f> is equivalent to the standard vector Breg- 
man divergence: 

d$[f,g] = <t>[f]-4>[g]-W[g\f-g] 

(30) = 4>(x)-4>(y)-V4>(y) T (x-y). 

5.2. Proof of Proposition 1.3. First, we give a constructive proof of the first part 
of the proposition by showing that given a B s>Vl there is an equivalent functional 
divergence d$. Then, the second part of the proposition is shown by example: we 
prove that the squared bias functional Bregman divergence given in Section 12.1.21 
is a functional Bregman divergence that cannot be defined as a pointwise Bregman 
divergence. 

Note that the integral to calculate B s u is not always finite. To ensure finite 
-B Si „, we explicitly constrain lim^^o s'{x) and lim^^o s(x) to be finite. From the 
assumption that s is strictly convex, s must be continuous on (0, oo). Recall from 
the assumptions that the measure v is finite, and that the function s is differentiable 
on (0, oo). 
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Given a B s ^ u , define the continuously differcntiable function 

s(x) x > 



s(x) 



-s(-x) + 2s(0) x<0. 



Specify : L°°{v) - 
Note that if / > 0, 



as 



</>[/] 



s{f{x))dv. 



s{f{x))du. 



x 



Because s is continuous on R, s(f) E L°° whenever / G the integrals always 
make sense. 

It remains to be shown that 5<j)[f; •] completes the equivalence when / > 0. For 
h . /. N . 



4>[f + h]-<j>[f] 



s{f{x) + h(x))du 



x 



s{f{x))dv 



X 



X 



s(f(x) + h{x)) - s{f{x))dv 
s'(f(x))h(x)+e(f(x),h(x))h(x)dv 
s'(f(x))h(x)+e(f(x),h(x))h(x)dv, 

JX 

where we used the fact that 

s(f(x) + h(x)) 

= S(f(x)) + (s'(f(x)) + e(f(x), h{x))) h(x) 
= s(f(x)) + (s'(f(x)) + e(f(x), h(x))) h(x), 

because / > 0. On the other hand, if h(x) — then e(f(x), h(x)) — 0, and if 
h(x) ^ then 

S(f(x) + h{x)) - S(f(x)) 



\e(f(x),h(x))\< 



h(x) 



W(f(x))\ 



Suppose {h n } C L°°{v) such that h n — > 0. Then there is a measurable set E 
such that its complement is of measure and h n —> uniformly on E. There 
is some N > such that for any n > N, \h n (x)\ < e for all x G E. Without 
loss of generality, assume that there is some M > such that for all x e E, 
1/0*01 < M. Since s is continuously differentiablc, there is a K > such that 
max{s'(0 subject to t 6 [— M — e,M + e]} < K, and by the mean value theorem 

s(f(x) + h{x)) - ~s(f(x)) 



for almost all x e X. Then 



h{x) 



s(f(x),h(x))\<2K, 



<K, 



except on a set of measure 0. The fact that h(x) — > almost everywhere implies that 
|e(/(x), h(x))\ — > almost everywhere, and by Lebesgue's dominated convergence 
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theorem, the corresponding integral goes to 0. As a result, the Frechet derivative 
of 4> is 

(31) 5<l>\f;h] = [ s'{f(x))h{x)dv. 

Thus the functional Brcgman divergence is equivalent to the given pointwise B sv . 

We additionally note that the assumptions that / 6 L°° and that the measure 
v is finite are necessary for this proof. Counterexamples can be constructed if 
/ G L p or v{X) = oo such that the Frechet derivative of <f> does not obey (|3Tj) . This 
concludes the first part of the proof. 

To show that the squared bias functional Bregman divergence given in Section 
12.1.21 is an example of a functional Bregman divergence that cannot be defined as 
a pointwise Bregman divergence we prove that the converse statement leads to a 
contradiction. 

Suppose {X, E, v) and (X, S, /x) are measure spaces where v is a non-zero a- finite 
measure and that there is a differentiable function / : (0, oo) — » R such that 

(32) (J Zto\ = J f(0dfi, 

where £ S .A 1 , the set of functions ^4 with p = 1. Let /(0) = limx-^o/^), which 
can be finite or infinite, and let a be any real number. Then 

2 / „ \ 2 



J f(aOd(i = (^J at<h?) = a 2 (J £dv 

= a 2 [ /(OdA*. 



Because v is er-finite, there is a measurable set E such that < |^(-E)| < oo. Let 
X\E denote the complement of E in X . Then 



a 2 v 2 {E) = a 2 ^ I E dv 
= a 2 j f(I B )dn 



Also, 



However, 



2 2 

a '■ 



a 2 / f(0)d^i + a 2 / /(l)dAi 
a 2 f(Q)ti(X\E) + a 2 f(l)[i(E). 

v 2 {E) =(^J aI E dv 



alsdu\ = J f(aI E )dfi 



f(aI E )dfj,+ / f(aI E )dfi 

X\E J E 

= f(0)n(X\E) + f(a)n(E); 
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so one can conclude that 

a 2 f(Q)n(X\E) + a 2 f(l) f i(E) 
(33) = /(0)/i(X\£)+/(cO/i(£). 

Apply equation |[52| for £ = to yield 

= (^j Mv\ = J f(0)dfi = f(PMX). 

Since |^(-E)| > 0, fi(X) ^ 0, so it must be that /(0) = 0, and l[33|) becomes 

a 2 v 2 (E) = a 2 f(l)[i(E) = f(a)n(E) Va e R. 

The first equation implies that fi(E) ^ 0. The second equation determines the 
function / completely: 

f(a) =/(l)a 2 . 

Then ([5^)) becomes 

2 



y ^ = y /(i)e 2 ^. 



Consider any two disjoint measurable sets, E\ and £2, with finite nonzero mea- 
sure. Define £1 = I El and £ 2 = ^E 2 - Then ( = £l + £ 2 and £i£ 2 = IeiIe, = 0. 
Equation ([32]) becomes 



(34) y y = /(i) y 

This implies the following contradiction: 

(35) y i x dv J bdv = v{E 1 )v{E 2 ) + 0, 
but 

(36) /(l) y &£ a d/i = 0. 

5.3. Proof of Theorem ILL Let 

J[g] = E PF [d <j> (F,g)}= f d^f, g]P(f)dM 

Jm 

(37) = / W]-M-W\9\f-9])P(f)dM, 



M 

where (|37|) follows by substituting the definition of Bregman divergence (JT]) . Con- 
sider the increment 

(38) AJ\g; a] = J[g + a] - J[g\ 

{g + a] - 4>{g}) P(f)dM 

M 

(5<f>[g + a; / - g - a] 

(39) -5<f>[g;f-g])P(f)dM, 
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where (|39j) follows from substituting (f37|) into (|38|) . Using the definition of the 
differential of a functional (see Appendix A, (fl8|) ). the first integrand in (|39|) can 
be written as 

(40) 0[ff + o]-^[ff] - *0[ff; o] + e[ff, a] ||a|| £1(l/) . 

Take the second integrand of (|39|) . and subtract and add 8<f>[g\ f — g — a], 

H[g + «; / - .9 - «] - s<p[g; f - g] 

= H[g + a;f -g-a]- 6cf>[g; f - g - a] 

+ $<P[g; / - .9 - a] - &<t>[g\ f - g] 

= 8 2 <j)[g;f - 3 - a,a}+e[g 1 a] \\a\\ L1(l/) + S(f>[g: f - g] 
-6<p[g;a] -6<p[g;f - g] 

S 2 (t>[g\f - g,a] - 6 2 cj)[g; a, a] + e[g,a] \\a\\ L i { „ } 
- 6<p[g;a] 

where (a) follows from (f2T))) and the linearity of the third term, and (b) follows from 
the linearity of the first term. Substitute (J40J) and ([H} into (f39|) . 



(6) 



(41) 



AJ[<?; 



A/ 



(^[s; / - g, a ] - <5 2 <?%; a , <A 



e[g,a}\\a\\ L1 ^)P(f)dM 



Note that the term 6 2 <f)[g;a, a] is of order IHI^i^), that is, \\S 2 
m \\a\\ L i, v \ for some constant m. Therefore, 



:a,a 



< 



where, 
(42) 



lim 



<5J[g;a] 



\\J[g + a]-J[g}-6J[g;, 



0. 



(") 



5 2 <f>[g;f-g,a]P(f)dM. 



A I 



For fixed a, S 2 <fi[g; -,a] is a bounded linear functional in the second argument, so 
the integration and the functional can be interchanged in (|42[) . which becomes 



6J[g; a] = -S 2 



<?; / (f-g)P(f)dM,a 



A I 



Using the functional optimality conditions (stated in Appendix A), J[g] has an 
extremum for g = g if 



(43) 



9-, / (f-g)P(f)dM,a 



A I 



0. 



Set a = J M (/ — g) P(f)dM in ([43]) and use the assumption that the quadratic 
functional 5 2 (j)[g; a, a] is strongly positive, which implies that the above functional 
can be zero if and only if a = 0, that is, 



(44) 
(45) 



(/ - 9)P(f)dM, 



A I 



.9 = E Pf [F], 



2:S 



where the last line holds if the expectation exists (i.e. if the measure is well-defined 
and the expectation is finite). Because a Bregman divergence is not necessarily 
convex in its second argument, it is not yet established that the above unique 
extremum is a minimum. To see that (I45[) is in fact a minimum of J[g], from the 
functional optimality conditions it is enough to show that S 2 J[g;a,a] is strongly 
positive. To show this, for b e A, consider 



SJ[g + b\ a] — 5J[g; a] 



(c) 



('-') 



(e) 



(/) 



(5 2 4>[g + b;f-g-b,a] 
-S 2 ^[g;f- 9l a})P(f)dM 

[ (5 2 ^g + b; f - g - 6, a] - 5 2 ^g- f-g-b,a] 

JM 

+ 5 2 4>{g; f-g-b,a]- <*Ms; f - g, a})P(f)dM 
- [ (5 3 <t>[g; f-g-b,a,b} + e[g, a, b] \\b\\ L , {v) 

JM 

+ 8 2 <p[g; f -g,a]- S 2 <p[g; b, a] 
-S 2 <j>[g;f-g 1 a])P(f)dM 

(5 3 (j)[g-J ~g,a,b]-S 3 cl)[g;b,a 1 b] 



M 

1, 



(46) + e[g, a, b] \\b\\ Ll{v) - S 2 ^; b, a])P(f)dM, 



where (c) follows from using integral ([4^]) ; (d) from subtracting and adding S 2 <fi[g; f— 
g — b, a] ; (e) from the fact that the variation of the second variation of (f> is the third 
variation of <p [25] ; and (/) from the linearity of the first term and cancellation of 
the third and fifth terms. Note that in (|4"6")) for fixed a, the term S 3 (f>[g; b, a, b] is of 
order H&ll^i^,)) while the first and the last terms are of order H&H^i^v Therefore, 



\8J\g + b\a] -SJ[g;a] - S 2 J[g; a, b] || L 
lim — — — = 0, 



where 



S 2 J[g; ai b] = -f S 3 ( f>[g-J~g,a 1 b}P(f)dM 

JM 



M 

(47) + / 5 2 <p\g;a,b]P(f)dM. 

I M 
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Substitute 6 = a, g = g and interchange integration and the continuous functional 
in the first integral of (|47j) . then 



6 2 J[g;a,a\ 



g; I (f-g)P(f)dM,a,a 

M 

■a,a]P(f)dM 



(48) 



> 



A I 



M 



S 2 <f>\g; a, a]P(f)dM 
k\\af L1{v) P(f)dM 



k \\ a \\ L Hu) > °> 



(49) 

where (|48|) follows from (|44|) , and (|49|) follows from the strong positivity of 6 2 <p[g; a, a]. 
Therefore, from (|49|) and the functional optimality conditions, g is the minimum. 

5.4. Derivation of the Bayesian Distribution-based Uniform Estimate Re- 
stricted to a Uniform Minimizer. Let f(x) = 1/a for all < x < a and 

g(x) = 1/6 for all < x < b. Assume at first that b > a; then the total squared 
difference between / and g is 



L 



(f(x)~g(x)) 2 dx = 



a 



ab 
\b-a\ 



where the last line does not require the assumption that b > a. 

In this case, the integral (fT5|) is over the one-dimensional manifold of uniform 
distributions IA\ a Riemannian metric can be formed by using the differential arc 
element to convert Lebesgue measure on the set W to a measure on the set of 
parameters a such that (fT"3|) is re-formulated in terms of the parameters for ease of 
calculation: 



(50) 



arg mm 

6GR+ 



ab 



d[_ 

da 



da, 



where a n is the likelihood of the n data points being drawn from a uniform distri- 
bution [0, a], and the estimated distribution is uniform on [0,6*]. The differential 



arc element 



can be calculated by expanding df /da in terms of the Haar or- 



thonormal basis {^,<f)jk{x)}, which forms a complete orthonormal basis for the 
interval < x < a, and then the required norm is equivalent to the norm of the 
basis coefficients of the orthonormal expansion: 



(51) 



4f 

da 



1 



7,3/ 2 ' 



For estimation problems, the measure determined by the Fisher information 
metric may be more appropriate than Lebesgue measure [21-23]. Then 



(52) 



dM = \I{a)\*da, 



2.5 



where I is the Fisher information matrix. For the one-dimensional manifold M 
formed by the set of scaled uniform distributions 11, the Fisher information matrix 
is 

21 

/ (7, IO0' — \ 

1(a) = E x 



d\os - 

& a 



da 



a 1 j 



1 



so that the differential element is dM = — . 

a 

We solve (|13p using the Lebesgue measure (IFT|) ; the solution with the Fisher 
differential element follows the same logic. Then (JSOJ) is equivalent to 

\b-a\ 1 



arg min J(b) 

b 



-da 



b 

a=X„ 



b — a da 
ab a™+ 3 / 2 
2 



ab a™+ 3 / 2 
a — 6 da 



ab a n + 3 / 2 
1 



(n + l/2)(n + 3/2)6«+3/2 b(n+ i )x ^J 
1 

(n + 3/2)X££ /2 ' 
The minimum is found by setting the first derivative to zero: 

2 (n + 3/2) 



J'(6) = 



(n+ 1/2) (n + 3/2) gn+5/2 
^0 



S2(n + l/2)Xix 1/2 
=> S = 2^7^ X max . 
To establish that & is in fact a minimum, note that 



tyn+l/2 
us\ max 



L ^ 1 A max 



> 0. 



Thus, the restricted Bayesian estimate is the uniform distribution over [0, 2 «+V 2 X max ] . 
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